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Abstract 

Cancer arises from successive rounds of mutations which generate tumor cells with different 
genomic variation i.e. clones. For drug responsiveness and therapeutics, it is necessary to identify 
the clones in tumor sample accurately. Many methods are developed to infer tumor heterogeneity 
by either computing cellular prevalence and tumor phylogeny or predicting genotype of mutations. 
All methods suffer some problems e.g. inaccurate computation of clonal frequencies, discarding 
clone specific genotypes etc. In the paper, we propose a method, called- HetFHMM to infer tumor 
heterogeneity by predicting clone specific genotypes and cellular prevalence. To infer clone specific 
genotype, we consider the presence of multiple mutations at any genomic location. We also tested 
our model on different simulated data. The results shows that HetFHMM outperforms recent 
methods which infer tumor heterogeneity. Therefore, HetFHMM is a novel approach in tumor 
heterogeneity research area. 


1 Introduction 

Cancer is a disease, caused by somatic mutations that accumulate in the genome during the lifetime 
of human El- Somatic mutations generate tumor cells with different genomic variation within tumor 
to form carcinoma i.e. cancer. This phenomena is known as Int.ra Tumor Heterogeneity mmmm 
[23 ■ In tumor heterogeneity, each type of cells with distinct genomic structure are known as clones [5] 
[12]■ Caraco et al. [T] identifies that drug responsiveness and cancer therapies are clone dependent. To 
develop more accurate drug and therapy, it is important to predict clonal subpopulations (i.e. clones) 
within tumor accurately. 

Each clone contains distinct genomic variation which is expressed by the precise order of nucleotides 
of DNA molecules, different from other clones m eh. To predict the order of nucleotides of DNA 
molecules, several techniques are developed. These techniques are known as DNA sequencing. Sanger 
sequencing [18] is the first approach to predict precise order of nucleotides of DNA molecules. This 
technique cannot produce complete nucleotide sequence of a DNA molecule. Later a new technique 
is developed to predict complete nucleotides sequence from DNA molecule called Next Generation 
Sequencing in short NGS. NGS produces many short sequences of nucleotides to predict complete 
nucleotide sequence of a DNA molecule. These short sequences are known as short reads. During 
DNA sequencing, some short reads are deleted or amplified by E. coli which is used for cloning DNA 
fragments. Therefore, sometimes short reads cannot produce complete nucleotide sequence of a DNA 
molecule. It is a challenging task to infer clones from short reads. 

Many researches and methods are developed to infer clones from short reads. These methods 
detect clones by either estimating cellular prevalenc^] and tumor phylogenjj^] or inferring genotypes 
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of all mutations. At first researchers develop many methods to infer genotypes of all mutations by 
considering genotype as the signature of a clone. GPHMM m and Apolloh [5] are the pioneer methods 
to infer clones from short reads. These methods detect only copy number variation (CNV) in form of 
genotypes to predict clones by assuming that only one clone exhibits one mutation which appears at 
a genomic location. Loss of heterogeneity (LOH^is not predicted together with CNV. TH-HMM [22] 
and CLImAT [2U detect LOH with CNV. But GPHMM [T3], Apolloh [5], TH-HMM [22] and CLImAT 
m do not consider the presence of multiple clone which do not exhibit a mutation at a location in 
tumor sample. This important feature is considered in OncoSNP-SEQ [23 and TITAN m to infer 
genotypes. Above mentioned methods do not work on presence of multiple mutations at any location. 

Roy et al. m predicts the phylogeny relation between clones by using multicolor lineage tracing 
method alias Brainbow. To identify the phylogeny, it is necessary to cluster mutations. TITAN m 
clusters all mutations into several classes. Other than TITAN [TD], PyClone [IB], PhyloSub [12], 
PhyloWGS [3j and Rec-BTP m identify clusters of mutations by computing and clustering cellular 
frequencies of all mutations. But these methods cannot identify the clones with same cellular prevalence 
separately. Like GPHMM m and Apolloh p5], these methods consider the presence of one mutation 
at a genomic location. 

In real world, many mutations can appear at any genomic location. There are several scenarios 
appear at any genomic location, (a). Some clones have same mutation which is harboured by their 
ancestor clone, (b) One clone exhibits a mutation which is identical from others. Genotype of a genomic 
location cannot reveal types of mutation which appear at that location. Clone specific genotype is a 
special genotype which express the genotype of a mutation harboured by a clone. Clonal signatur^] 
clusters of mutations and types of mutation of a location can be easily predicted by clone specific 
genotype. No existing method can detect clone specific mutation. This important feature is addressed 
in our proposed method- Het erogeneity Factorial Hidden Markov Model (HetFHMM). On the other 
hand, existing methods detect either cellular prevalence or genotypes. Genotype of a mutation and 
cellular prevalence combinely affect the count of short reads which are the inputs of existing methods. 
Our proposed HetFHMM infers clone specific genotypes and cellular prevalences together from the short 
reads to predict clonal subpopulations within tumor. It is a novel approach in tumor heterogeneity to 
predict clones within tumor. 

Our proposed HetFHMM is discussed in section [2] We tested our designed method with recent 
methods for simulated data. It is found that our method outperforms recent methods. We discuss our 
experiments and results in section [3] 


2 HetFHMM 

To infer clone specific genotypes and cellular prevalence from short reads, we develop a factorial hidden 
Markov model (FHMM) which is first proposed by Ghahramani et al. [5] . In FHMM, there are n 
number of chains. Each chain for individual object. Ghahramani et al. [5] first develops FHMM to 
identify the multiple simultaneous potential speakers from speech signals. In this FHMM, each chain 
is dedicated for each speaker. In our problem, tumor sample contains multiple clones whose genotypes 
and cellular prevalence would be different from each other. Like Ghahramani et al. [5], we predict 
clones by n chains using FHMM. In HetFHMM, we consider first chain i.e. chain 0 as normal cells. 
Rest of other chains represent the clones within tumor sample. 

In the model, genotype Gt, n of any clone n at ith location is one of the hidden variables. Duan et 
al. [Bj is found in their experiments that copy number varies upto 5. Copy number variation exhibits 21 
genotype states are shown in table-[2] In the model, we infer another latent variable- cellular prevalence 
vector $ which contains cellular prevalence of all clones of tumor sample. 

Our model takes the number of total short reads N t , reference read count^] at and log ratio of 
tumor-normal reads depth l t at any mutant location t as inputs. Reference read counts at follows 
binomial distribution with total sequence reads N t and Pf, t (G*, d>). The parameter Pf, t (Gt, $) computes 


3 Loss of heterozygosity (LOH) is a gross chromosomal event that results in loss of the entire gene and the surrounding 
chromosomal region. 

4 Expressed by genotypes 

5 Both alleles are missing at genomic location. 

6 One allele is missing at genomic location. 

7 The number of short reads which is matched with reference genome 
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Table 1: Genotype variable space 


copy number 

Genotype state 

Genotype 

Description 

0 

0 

0 

Nullizygou^j 

1 

1 

A 

hemizygou^j 

2 

B 

2 

3 

AA 

Copy neutral with LOH 

4 

AB 

Normal copy 

5 

BB 

Copy neutral with LOH 

3 

6 

AAA 

Three copies with LOH 

7 

AAB 

Three copies with duplication of A allele 

8 

ABB 

Three copies with duplication of B allele 

9 

BBB 

Three copies with LOH 

4 

10 

AAAA 

Four copies with LOH 

11 

AAAB 

Four copies with duplication of A allele 

12 

AABB 

Four copies with duplication of both alleles 

13 

ABBB 

Four copies with duplication of B allele 

14 

BBBB 

Four copies with LOH 

4 

15 

AAAAA 

Five copies with LOH 

16 

AAAAB 

Five copies with duplication of A allele 

17 

AAABB 

Five copies with duplication of both alleles 

18 

AABBB 

19 

ABBBB 

Five copies with duplication of B allele 

20 

BBBBB 

Five copies with LOH 


probability of reference allele, as follows: 


Ph, (G t , <k) 


zto 


gt,k - c gt,k 
k- c gt,k 


(1) 


where K , r gt k and c gt k are the number of clones, reference allele raticj^] and copy number of clone k 
at mutant location t respectively. In the model we assume that log ratio of tumor-normal read depth 
is gaussian distributed with mean nit(Gt, $, 0 ) and standard deviation er. The number of short reads 
of tumor and normal cells are very large. As per central limit theorem, all large data are normally 
distributed. For this reason we consider log ratio of tumor-normal read depth is gaussian distributed. 
Mean of log ratio of tumor-normal read depth is: 


Pt(G t , $, o) 


T K 

l^k=C 


4>k-Cg 


& 0 - c g t ,o- Zk =1 tyk-O 


( 2 ) 


Where o be tumor ploidy parameter which is set to 3. In addition, our proposed model can work on 
multiple samples x £ X of same patient. Probabilistic graphical model of HetFHMM is shown in fig-[T| 
Factorial Hidden Markov Model contains transition and emission probabilities. Like FHMM, tran¬ 
sition probability of HetFHMM has been expressed as P(G t ,h = i\Gt-i,k = j ) = A t fc(?',j). We employ 
transition probability from the study done by Colella et al. jS] which captures the position specific effects 
accurately, is as followed: 

P(G t ,k = i\Gt-i,k = j) = ^ 

At 

where 

Pt = 1 - 3 ( 1 - e ^) 

L, dt and Z\. be the average length of the sequence read^] dimension of state space (in this model it is 
21) and distance between the mutant locations t and t— 1 in clone k respectively. Emission probabilities 
for generating observation based on hidden variables Gt,k an d fk are as followed: 


Pt 

1 -pt 

D k -1 


if 1 j 
Otherwise 


(3) 


P(a t \N t ,G t ,^) = Bin(a t \N t , P bt (G t , <&)) (4) 

P(l t \cr,(l>,Gt,$) = J\f(l t \nt{Gt,$,o),a) (5) 


A 


8 r g± k = Aat ’ k where A g± k is the number of A allele of any genotype Gt For an example, if genotype is AAB, 

9t,k 

9t,k = 2 - C 3t,k = 3 and r 3t,k = | 

9 It was observed to be 2 Megabases (2 x 10 6 bases) in 104 breast tumors (rounded to the nearestMb.) 


3 



































Figure 1: Probabilistic graphical model of HetFHMM 


In order to infer the hidden variables G and <f>, we use negative logarithm of likelihood function to 
pursue G and >I> which give the highest joint probability of the model. 

K T 

P(G,l,a,$\N,) = nn A t (G t , k \G t - ltk ) 

k—0 t— 0 
T 

nn Bin(a t>x \N ttX , P bt x (G t , ^))N{l t ^ tiX {G u <f>, o), a) (6) 

x€X t—0 

Since the exact inference for FHMM is intractable 0 , we use a sampling method to get an approxima¬ 
tion. Markov chain Monte Carlo (MCMC) sampling is widely adopted for this task. Gibbs sample in 
one of the simple sampling schemes among MCMC methods. In order to run Gibbs sampling, we need 
to start with an initial model in which all genotypes are randomly generated. Except for the normal 
chain, we randomly choose a genotype for each variable based on the uniform distribution. Then, each 
hidden variable is sampled given the current state of rest of the variables. In our case, the probability 
of each genotype for a hidden variable Gt,k is : 

P(Gt,k) cx -A t (Gt t k\Gt-i,k)-At-t-i(Gt t k\Gt-i,k)Pin(Qt,x\Nt iX , Pb t ^)N c) (7) 

We sample each current state given another four variables: the previous genotype Gt- i, the next 
genotype Gt+i and two observations l t and a*. 

Having fixed all the values for genotype variables G, the negative log likelihood can be minimized 
using Exponentiated Gradient Descent, which is similar to normal Gradient Descent. Gradient Descent 
is an algorithm to find the minimum of a given objective function (target function) by gradually 
approaching the minimum along the direction of the gradient function. Exponentiated Gradient (EG) 
[?] algorithm is a variant of normal Gradient Descent. The difference is that the update for Gradient 
Descent is to subtract the gradient of a target function, where as in EG the update is done by multiplying 
the exponents of the negative gradient. So in EG we pursue: 

max—£($) ( 8 ) 

that is (j>k = 1 and </>* > 0 where C denotes the objective function. In addition, it is proved to 
perform better when the target is sparse. In other words, it allows us to identify clones even if it 
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contains only a very small proportion of cells. In our case, the objective function is the log likelihood 
function of the model, i.e. £(<!>) = log(P(G, l, a , <f>|TV, o)). To solve the above maximization problem, 
the EG updates are as follows: 

cf)™ w =( j) k e-^ v ^ c ^ (9) 

where 77 is the learning rate. After updating each component of the latent vector $, the values are 
normalized so that they sum to one. In our model, for the EG updates, we need the derivatives which 
are derived using the chain rule, as follows: 


£($) = X! lo S ) + at lo S P bt + ( N t - at) log(l - P bt ) 


+ log(^=)- (Zt ^ 


t\[ 7 2/k 


2 a 2 


const 


( 10 ) 


V£ W = 51 Kir - TTfr) • v » + ■ ml dU 

I * b t 1 & 


dPbt _ Cf.k ' ( t'i.,h Pbt) 


dfik 

Efc =0 &k ' °t,k 

d/i t 

Ct ,0 • (1 - th) 

d(f>o 

<^o ■ ct, 0 + yijt. ^ 'O 

d/J. t __ 

Ct ,0 • out 


d(j>k 4>0 ‘ Ct ,0 + ' 4 1 


Substitute and back to £(<!>), we can get the gradient of the objective function with respect 
to variable $. 

It is known that Gibbs sampling stops when the convergence criteria is met. In our case, we 
define the number of times that each genotype and cellular prevalence of n clones are sampled as the 
convergence criteria. 


3 Experiments and results 

We carry out several experiments to HetFHMM on simulated data to evaluate its inference of clone 
specific genotypes at any mutant location including cellular prevalence. We generate simulated data 
from factorial Hidden Markov model with the number of clones varies from 3, 4, 5 and 6 . The cellular 
prevalence vector $ is specified before generating FHMM. For each FHMM, we select the number of 
mutant location by random which varies from lk to 10k. After selecting the location, first we proceed 
to chain 0 which represents as normal cells with genotype AB. We assume that hidden variables for 
all chains at first location are also AB. Next, the genotypes of each clone at all mutant locations are 
generated using equation-|3] We set the expected probability of reference allele P b and the mean /u t of 
It by using equations-[ 2 ] and |T] respectively. Finally we generate observation variables: reference allele 
read counts and log ratio of tumor-normal contents by random production of total read counts which 
varies from lk 10 k. 

We implement our model HetFHMM on these simulated data. For comparing our model with recent 
methods, we also implement PyClone |TB], PhyloSub [fjj and Rec-BTP [TT] on these data. HetFHMM 
gives two outputs: cellular prevalence and clone specific genotypes. Whereas PyClone [T5], PhyloSub 
[12] and Rec-BTP |TTj predict cellular prevalence and clusters of all mutations. No method infers clone 
specific genotypes from the short reads. Clusters of all mutations can be determined from clone specific 
genotypes. Similar types of output of HetFHMM are also generated by PyClone Pj5], PhyloSub |T2] 
and Rec-BTP [TT]. For this reason we compare the results of HetFHMM with PyClone [T6], PhyloSub 
HU and Rec-BTP \YT\. 

We use V-measure to assess the clusters of all mutation that are either generated from clone 
specific genotypes or PyClone [16] . PhyloSub [T5] and Rec-BTP [TTj with gold standard. V-measure is 
an entropy-based external cluster validation measure to evaluate predicted clusters with respect to gold 
standard classes. It measures how successfully the criteria of homogeneityp’] and completenesEI have 

10 Each cluster contains only members of a single class m- 

11 All members of a given class are assigned to the same cluster m- 
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been satisfied. V-measure is the only external entropy based measure which successfully and efficiently 
evaluate cluster prediction than any other cluster evaluation measures. It is also used in PyClone to 
evaluate its clustering methodology. Rosenberg et al. [15] defined homogeneity as followed: 


h = 1 — 


H(N\n) 


H(N) 

where H(N\n ) is the conditional entropy of the class N given the cluster n , as followed: 

k c 

H(N\n) = - EE 


^ck i C^ck 


k—1 c—1 


M 


E c =i ack 


(15) 


and H(N ) is the entropy of the class N, as followed: 


H(N) 


E Mk=l Q cfc , Mk—1 a ck 
M & M 

C—1 


C, K, M, N, and n and a Ct k are the number of gold standard clones, the number of predicted 
clones, the number of mutations, a gold standard clone, a predicted clone and the number of mutations 
that are members of gold standard clones c and predicted clones k. They also defined the completeness 
as followed: 


H(n\N) 

H(n) 


(16) 


where H(n\N) is the conditional entropy of the cluster n given the class N, as followed: 


H(n\N) 




tick 


c=l k—1 


2-,k =i 


and H{n) is the entropy of the cluster N, as followed: 


H(n) 


K \~^C 

E M C =1 a ck , Z^c= 1 Q ck 

M & M 

k —1 


V-measure is defined as the harmonic mean of distinct homogeneity and completeness scores by 
Rosenberg et al. m : 

t 2 xhxc 

V-measure = — - - (17) 

h + c 

The value of V-measure ranges from 0 to 1. If its value is 1, it means perfect matching; otherwise 
not. For assessing our inferred clone specific genotype with the clusters of mutations by V-measure, 
we generate a two dimensional matrix in which if any mutation t is present in any clone k 

means the genotype of t mutation in clone k is other than AB, Mj2 t = 1; Otherwise zero. We also 
produce similar matrices M^ xT , M|? xT , M£ xT and M^ xT for gold standard, Rec-BTP [11] . PyClone 
m and PhyloSub m respectively. Generated genotypes that are used to produce simulated data, are 
considered as gold standard. 

Using these matrices, we compute V-measure of the clustered outputs of HetFHMM, PyClone |16| . 
PhyloSub U2] and Rec-BTP [TTj with respect to gold standard. The cluster validation result is shown 
in fig--[2j It clearly shows that HetFHMM outperforms Rec-BTP [IT], PyClone [TBj and PhyloSub [ T2] , 
PyClone [15], PhyloSub |12| and Rec-BTP [IT] produce more clones than original number of clones 
which affects the completeness and homogeneity of V-measure. More over, from fig.-[2j it is also found 
that increasing number of clones inversely affects the V-measure of each method. 

We evaluate the clonal frequencies by using RMSD value after examining the clusters of mutation. 
Root mean square distance or RMSD is used to compute the distance/ gap between cellular prevalence 
of predicted clones and gold standard clones. Many predicted clones contains some of mutations of a 
gold standard clone. Among them, a predicted clone would contain maximal number of mutations of 
gold standard clones. This predicted clone is called significant clone of a gold standard clone. This is 
why, clonal frequencies of all predicted clones are not compared with gold standard. Before computing 
RMSD value, we obtain significant clones from predicted clones. For detecting significant clones, we 
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Figure 2: V-measure result comparison among HetFHMM , Rec-BTP , PyClone and PhyloSub algo¬ 
rithms for 3, 4, 5 and 6 clones synthetic dataset. 


set a value called threshold. If V-measure of any predicted clone n and gold standard clone N is above 
threshold, the predicted clone n is considered as significant clone of gold standard clone N. RMSD 
value is calculated by: 

RMSD = y I ^ Sv significant clones 11^™ — &n\\ 2 

if n is significant clone of N, detected by V-measure 

S be the number of significant predicted clones. The zero value of RMSD indicates the perfect 
computation of clonal frequencies. But, most of the cases, RMSD value of different methods changes 
significantly. For this reason, we use log RMSD instead of RMSD. log RMSD value which helps us 
to understand the cellular prevalence comparing results better than RMSD. 



3 Clones 4 Clones 5 Clones 6 Clones 


11 HetFHMM PhyloSub 11 PyClone 11 Rec-BTP 


Figure 3: log RMSD comparison among HetFHMM , PhyloSub , PyClone and Rec-BTP algorithms for 
3, 4, 5 and 6 clones synthetic dataset. 

In cellular prevalence evaluation, we set the value of threshold to 0.25. The comparative result of 
log RMSD of HetFHMM, Rec-BTP [TT|, PyClone |TB] and PhyloSub [T2] are shown in fig.-[3] We find 
that HetFHMM produces less log RMSD value compare to Rec-BTP [TT], PyClone [TBJ and PhyloSub 
|12| . The V-measure comparison shows that mutation clustering by Rec-BTP m, PyClone [16] and 
PhyloSub [T2[ are not more accurate than HetFHMM. It affects the number of mutations in a clone 
which affects the cellular prevalence of recent methods. 

4 Conclusion 

HetFHMM can detect clones by genotypes and cellular prevalence together. We develop HetFHMM 
by considering the presence of multiple clones and mutations at any genomic location which helps 
to find the more accurate clones and their cellular prevalence than existing methods: PyClone m, 
PhyloSub PU and Rec-BTP |lTj. Therefore, it can be said that HetFHMM is a novel approach in 
tumor heterogeneity research to infer clones. 


7 





















Despite improved performance of HetFHMM, it has some issues. We consider any mutation is 
dependant on previous mutation. In the real world, all previous mutations have position specific affects 
on one mutation which is not considered in this model. The number of clones within a tumor sample 
is predefined, but it needs to be predicted from tumor sample. On the other hand, we experiment our 
model only on simulated data. We have a plan in future to test our model on real large tumor sample. 
We also improve our HetFHMM by considering above mentioned important features. 
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